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Recent advances in coarse-grained lattice and off-lattice protein models are reviewed. The sequence dependence 
of thermodynamical folding properties are investigated and evidence for non-randomness of the binary sequences 
of good folders are discussed. Similar patterns for non-randomness are found for real proteins. Dynamical 
parameter MC methods, such as the tempering and multisequence algorithms, are essential in order to obtain 
these results. Also, a new MC method for design, the inverse of folding, is presented. Here, one maximizes 
conditional probabilities rather than minimizing energies. By construction, this method ensures that the designed 
sequences represent good folders thermodynamically. 
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1. Introduction 

Proteins are heterogenuous chain molecules 
composed of sequences of amino acids. The pro- 
tein folding problem amounts to given a sequence 
of amino acids predict the protein 3D structure. 
There are 20 different amino acids. In the Bioin- 
formatics approach one aims at extracting rules in 
a "black-box" manner by relating sequence with 
structure from databases. Here we pursue the 
physics approach, where given interaction ener- 
gies, the 3D structures and their thermodynam- 
ical properties are probed. In principle, this can 
be pursued on different levels of resolution. Ab 
initio quantum chemistry calculations can not 
handle the huge degrees of freedom, but are of 
course useful for estimating interatomic poten- 
tials. All-atom representations, where the atoms 
are the building blocks, also require very large 
computing resources for the full folding problem 
including thermodynamics, but are profitable for 
computing partial problems, binding energies etc. 

Here we pursue a course-grained representa- 
tion, where the entities are the amino acids. This 
is motivated by the fact that the hydrophobic 
properties of the amino acids play a most im- 
portant role in the folding process - the amino 
acids that are hydrophobic (H) tend to form a 
core, whereas the hydrophilic or polar ones (P) 
are attracted to the surrounding H 2 solution. In 
such representations, the interactions between the 



amino acids and the solvent are reformulated in 
an effective interaction between the amino acids. 

2. Coarse-Grained Models 

Both lattice and off-lattice models have here 
been studied. 

A well studied lattice model is the HP model (!]] 

E(r,o-) = -J2<r i o- j A(r i -r j ) (1) 

i<j 

where A(r^ — rj) = 1 if monomers i and j are non- 
bonded nearest neighbors and otherwise. For 
hydrophobic and polar monomers, one has a% = 1 
and 0, respectively. Being discrete, this model has 
the advantage that for sizes up to N = 18 in 2D it 
can be solved exactly by exhaustive enumeration. 

Similarly off-lattice models have been devel- 
oped, where adjacent residues are linked by rigid 
bonds of unit length to form linear chains ^,|). 
The energy function is given by 



E(r,a)=J2 F * 



(2) 



where Fi is a local sequence-independent in- 
teraction chosen to mimic the observed local 
correlations among real proteins and the sec- 
ond term corresponds to amino-acid interactions, 
the strengths/signs of which are governed by 
e(ai,aj). 
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3. Folding 

Investigating thermodynamical properties of 
chains given by Eqs. is extremely tedious 

with standard MC methods; Metropolis, the hy- 
brid method etc. Hence novel approaches are 
called for. Dynamical Parameter approaches have 
here turned out to be very powerful; the temper- 
ing and multisequence || methods. In both 
approaches one enlarges the Gibbs distribution. 
In [^U^| one simulates 

P(r, k) = i cxp(- 9fe - E(r, a)/T k ) (3) 

with ordinary r and k updates for T\ < ... < 
Tk , regularly quenching the system to the ground 
state. The weights are gk are chosen such that the 
probability of visiting the different T k is roughly 
constant. Similarly in the multisequence method 
the degrees of freedom are enlarged to include 
different sequences according to 

P(r, a) = | exp(- ff(T - E(r, a)/T) (4) 

where again g a is a set of tunable parameters, 
which are subject to moves jointly with r. 

When estimating thermodynamical quanti- 
ties, these dynamical parameter methods yield 
speedup factors of several orders of magnitude. 

A key issue when studying properties of pro- 
tein models are to what extent different sequences 
yield structures with good folding properties from 
a thermodynamic standpoint. Defining good fold- 
ing properties is straightforward in the lattice 
model case - non-degenerate ground states. For 
off-lattice models a suitable measure can be de- 
fined in terms of the mean-square distance S^ b be- 
tween two arbitrary configurations a and b. An 
informative measure of stability is the mean (<5 2 ) 
@. With a suitable cut on (<5 2 ) good folders are 
singled out. For both lattice and off-lattice mod- 
els, only a few % of the sequences have good fold- 
ing properties []. When analyzing the sequence 
properties of good folders, one finds that similar 
signatures occur among real proteins when using 
a binary coding for the hydrophobicities || . One 

1 Similar fractions are obtained within the replica approach 
for lattice models O. 



might speculate that only those sequences with 
good folding properties survived the evolution. 

4. Design 

The "inverse" of protein folding, sequence op- 
timization, is of utmost relevance in the context 
of drug design. Here, one aims at finding opti- 
mal amino acid sequences given a target structure 
such that the solution represents a good folder. 
This corresponds to maximizing the conditional 
probability fio|| , 

P(rok) = -i-exp(-£(r ,a)/T) (5) 
Z(a) 

Z(a)=^exp(-£;(r,cT)/T) (6) 



Note that here Z(a) is not a constant quantity. A 
straightforward approach would therefore require 
a nested MC - for each step in a a complete MC 
has to be performed in r jy]. Needless to say, 
this is extremely time consuming. Various ap- 
proximations for Z has been suggested; chemical 
potentials fixing the net hydrophobicity and low- 
T expansions Jjjj . Neither of these produce good 
folders in a reliable way. 

Here we devise a different strategy based upon 
the multisequence method pHy . The starting 
point is the joint probability distribution (Eq. 
(||)) The corresponding marginal distribution is 
given by 

P(ct) = ^P(r,a) = ^e X p(- gr7 )Z(a) 

r 

Z = ^exp(- 5CT )Z(cT) (7) 



With the choice 



9a = -E(r ,a)/T 



one obtains 



P(r \o) = 



P{r : 



P(a) ZP{a) 



(8) 



(9) 



In other words, maximizing P(ro\a) is in this case 
equivalent to minimizing P(cr). This implies that 
bad sequences are visited more frequently than 
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good ones in the simulation. This property may 
seem strange at a first glance. However, it can 
be used to eliminate bad sequences. The situa- 
tion is illustrated in Fig. [|. Basically, one runs a 




r r 



Figure 1. The distribution P(r, a). The choice of 
g a (Eq. (|8|)) implies that P(tq, <t) is flat in a. Se- 
quences not designing ro have maxima in P(n\a~) 
for n =fi ro due to states with E(n, a) < E(ro, a). 
Sequence designing ro have unique maxima at 
r = ro in P(r\a), which for low T contains most 
of the probability. 



MC in both r and a using all (or a subset of) the 
sequences. Regularly, one estimates P(cr). Se- 
quences where P(<r) exceeds a certain threshold 
are then eliminated, thereby purifying the sam- 
ple towards designing sequences according to Eq. 

. For lattice models one can use an alternative 
to eliminating high P(a) sequences, by removing 
sequences with E(r,a) < E(ro,a). 

Testing any design algorithm requires that one 
has access to designable structures, i.e. structures 
for which there exist good folding sequences. Fur- 
thermore, after the design process, it must be 
verified that the designed sequence indeed has 
the structure as a stable minimum (good folder). 
For N < 18 2D lattice models this is of course 
feasible, since these models can be enumerated 
exactly. For larger lattice models and off-lattice 
models this is not the case and testing the design 
approach is more laborious. 

Extensive tests have been performed for ./V=16, 
18, 32 and 50 lattice and N=16 and 20 off-lattice 
chains respectively. For systems exceeding iV=20 
one cannot go through all possible sequences. 



Hence a bootstrap procedure has been devised, 
where a set of preliminary runs with subsets of 
sequences is first performed. Positions along the 
chain with clear assignments of H or P arc then 
clamped and the remaining degrees of freedom 
are run with all sequences visited. With no ex- 
ceptions, the design algorithm efficiently singles 
out sequences that folds well into the (designable) 
target structures. 

Acknowledgment: The results reported here 
were obtained together with A. Irback, F. Pot- 
thast, E. Sandelin and O. Sommclius. 
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